Better Regular Expressions

 

From "The Elements of Good Regex Style" some tips on writing better Regular Expression for efficient, precise pattern-matching:

  • Whenever Possible, Anchor. – "Anchors, such as the caret ^ for the beginning of a line and the dollar sign $ for the end of a line often provide the needed clue that ensures the engine finds a match in the right place. For instance, when we validate a string, they ensure that the engine matches the whole string, rather than a substring embedded in the string being examined. And anchors often save the engine a lot of backtracking. ..."
  • When You Know what You Want, Say It. When You Know what You Don't Want, Say It Too! – "When you feed your regex engine a lot of .* "dot-star soup", the engine can waste a lot of energy running down the string then backtracking. Be as specific as possible ..."
  • Contrast is Beautiful–Use It. – "When you can, use consecutive tokens that are mutually exclusive in order to create contrast. This reduces backtracking and the need for boundaries ..."
  • Want to Be Lazy? Think Twice. – "... a lazy quantifier has a cost: at each step inside the braces, the engine tries the lazy option first (match no character), then tries to match the next token (the closing brace), then has to backtrack. Therefore, the lazy quantifier causes backtracking at each step ..."
  • A Time for Greed, a Time for Laziness. – "A reluctant (lazy) quantifier can make you feel safe in the knowing that you won't eat more characters than needed and overshoot your match, but since lazy quantifiers cause backtracking at each step, using them can feel like bumping on a country road when you could be rolling down the highway. Likewise, a greedy quantifier may shoot down the string then backtrack all the way back when all you needed was a few nudges with a lazy quantifier."
  • On the Edges: Really Need Boundaries or Delimiters? Use Them–or Make Your Own! – "Most regex engines provide the \b boundary, and sometimes others, which can be useful to inspect an edge of a substring ..."
  • Don't Give Up what You Can Possess. – "Atomic groups (?> ...) and the closely-related possessive quantifiers can save you a lot of backtracking. ..."
  • Don't Match what Splits Easily, and Don't Split what Matches Nicely.
  • Design to Fail.
  • Trust the Dot-Star to Get You to the End of the Line

(cf Fuzzy Proximity (2000-03-18), Snip Pattern (2001-09-06), Reg Explanations (2003-12-03), Regex vs HTML (2020-02-27), Logical AND for Regex (2020-10-28), ...) - ^z - 2020-10-31